Fix non-deterministic node pks causing duplicate nodes across runs by mscottford · Pull Request #17 · vladbatushkov/strazh

mscottford · 2026-04-13T15:52:20Z

Note: This PR depends on #13 (upgrade to .NET 10), #14 (fix transitive reference drop), #15 (parallelize build and analysis), and #16 (stream pipeline between stages).

Summary

Duplicate node bug: string.GetHashCode() is randomized per-process in .NET Core and later. Using it as the Neo4j pk meant every run produced a different value for the same logical node, defeating MERGE and silently creating duplicate nodes. The symptom was projects shared between two solutions (e.g. via a git submodule) appearing as two separate Project nodes with CONTAINS edges from each solution, rather than one shared node with edges from both.
Fix: Replaced GetHashCode() with MD5 in all three SetPrimaryKey overrides (Node, MethodNode, PackageNode). MD5 is deterministic across processes and runtimes; since this is content-addressing rather than security, it's the right tool.
Neo4j concurrency: Also bumps the SemaphoreSlim cap for concurrent Neo4j insert sessions from 4 to Environment.ProcessorCount, consistent with the same cap used for the build stage.

Migration note: Existing databases contain nodes with stale random-hash pks. Re-run the tool with --delete to clear and rebuild the graph with deterministic pks.

Test plan

Run dotnet test — existing regression test should pass
Run the tool twice against the same solution without --delete and confirm no duplicate nodes are created
Run against two solutions sharing projects (e.g. a solution with a submodule) and confirm shared projects appear as a single node with CONTAINS edges from both solutions

🤖 Generated with Claude Code

- Targets net10.0 (up from net9.0) - Buildalyzer/Workspaces 7.1.0 → 8.0.0 - Microsoft.CodeAnalysis/CSharp 4.13.0 → 5.3.0 (Analyzers pin removed) - Neo4j.Driver 5.27.0 → 6.0.0 - System.CommandLine 2.0.0-beta4 → 2.0.5 (first stable release) - Dockerfile base image updated to dotnet/sdk:10.0-alpine - Neo4j image updated to 2026.03.1 with corrected env var names - docker-compose.yml: removed obsolete version attribute, replaced hardcoded Windows volume path with relative ./SystemUnderTest path - Code updated for breaking API changes in System.CommandLine 2.0 (SetAction, Options.Add, Aliases.Add, Required, AllowMultipleArgumentsPerToken) and Neo4j.Driver 6.0 (IDriver.CloseAsync removed, use await using) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…from analysis When AddToWorkspace(workspace, addProjectReferences: true) is called for a project, Buildalyzer eagerly adds its referenced projects into the workspace. The previous deduplication check silently dropped those projects on their own loop iteration, causing them to receive no CONTAINS triple and no code analysis. Adds a regression test using a two-project fixture solution where ProjectA references ProjectB — ensuring both appear in the analysis context regardless of workspace insertion order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Stage 1: Build projects in parallel using .AsParallel() in GetAnalysisContext; each p.Build() spawns an isolated MSBuild process so there is no shared state. AdhocWorkspace mutations remain sequential (not thread-safe). - Stage 3: Analyze and insert all projects concurrently using Task.WhenAll with Task.Run so CPU-bound syntax tree walking runs across thread-pool threads. The delete step is extracted to DbManager.DeleteData and run once upfront before the parallel loop to avoid races. - Stage 4: A SemaphoreSlim(4) gates concurrent Neo4j sessions to avoid exhausting the connection pool on large solutions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The workspace population step (AddToWorkspace) is the slowest sequential phase on large solutions and previously emitted no output, causing a silent pause between "Building projects - finished." and "done." Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…l to complete Previously the pipeline had two hard barriers: all projects had to finish building before workspace loading began, and all projects had to finish loading before analysis began. The Build stage now launches all MSBuild design-time builds concurrently via Task.WhenEach. As each build completes it flows immediately into the Load stage (sequential AdhocWorkspace mutation) and then yields a (Project, IAnalyzerResult) pair to the caller. The Analyze and Insert stages pick up each pair as it arrives via IAsyncEnumerable, so analysis of early- finishing projects begins while later projects are still building and loading. The solution-order sort is removed — AddToWorkspace(addProjectReferences:true) handles any insertion order correctly via the transitive-reference path introduced in the previous fix. GetAnalysisContext is updated to async Task<AnalysisContext> to consume the stream; the regression test is updated accordingly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AddToWorkspace(addProjectReferences: true) calls analyzer.Build() internally for any referenced project not yet in the workspace. When Build and Load were interleaved via Task.WhenEach, in-flight build tasks raced with AddToWorkspace's internal builds for the same project, causing Buildalyzer's environment detection to fail with InvalidOperationException: Could not find build environment. Fix: wait for all builds to complete (Task.WhenAll) before starting the Load stage. Build → Load streaming is lost, but Load → Analyze streaming is retained: each project is yielded as soon as it is loaded, so analysis begins immediately without waiting for all projects to finish loading. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Previously all projects were built in parallel with no limit, spawning one dotnet process per project simultaneously. A SemaphoreSlim(Environment.ProcessorCount) now gates the Build stage so at most one build per logical core runs at a time. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

string.GetHashCode() is randomized per-process in .NET Core and later (hash randomization was introduced to prevent hash-flooding DoS attacks). Using it as the Neo4j pk meant every run produced a different pk for the same logical node, defeating MERGE and silently creating duplicate nodes instead of updating existing ones. Symptom: projects shared between two solutions (e.g. via a git submodule) appeared as two separate Project nodes in the graph rather than one node with CONTAINS edges from both Solution nodes. Fix: replace GetHashCode() with MD5 in all SetPrimaryKey overrides. MD5 is stable across processes and runtimes and is appropriate here since this is a content-addressing use case, not a security one. Note: existing databases will have stale pks. Re-analyze with --delete to clear and rebuild the graph with deterministic pks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mscottford and others added 11 commits April 10, 2026 13:57

Adds .idea to .gitignore

4b13967

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Refactors domain nodes and triples to use C# primary constructors

5d4e672

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Increases concurrent Neo4j connections to match processor count

03a7590

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix non-deterministic node pks causing duplicate nodes across runs#17

Fix non-deterministic node pks causing duplicate nodes across runs#17
mscottford wants to merge 11 commits intovladbatushkov:mainfrom
mscottford:fix-non-deterministic-node-pks

mscottford commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mscottford commented Apr 13, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant